next up previous contents
Next: Conclusions Up: Computational Issues Previous: Prospects for automatic lexicon

Linguistic analysis plus Conventionality

  Throughout the previous sections, I have emphasised the importance of lexical generativity for advanced interpretation in NLP tasks, and for accommodating meaning ambiguities. Although rich lexical structure is not required for all NLP tasks, it will become increasingly important as NLP systems aim to extend their capabilities to handle more general natural language understanding and generation. We have seen that the design of the lexicon must be informed by linguistic analysis, to identify regularities in sense extensions and to capture syntactic and semantic generalisations associated with particular groups of words.

Following atkins:91 and atkins_levin:91, I would like to suggest that an adequate computational lexicon can only be established on the basis of top-down design derived from a linguistic theory in combination with bottom-up information derived from corpora about specific usage of language. Such an approach is supported by evidence that the probabilistic model of bruce_wiebe:95 improves in accuracy when augmented with analytical (theoretically derived) knowledge. The information derived form corpora might include, as suggested by inter alia Krovetz (1991), sense frequency information, co-occurrence relations and collocations. It should also include idioms and representation of proper nouns, which establish contexts in which a word can take on non-compositional meanings.

The notion of harnessing linguistically-derived insights to aid lexicon design and automatic lexical acquisition has also been convincingly advocated by Light light:96, who shows that surface cues, such as morphological features of a word, can have consistent correspondences to lexical semantic features associated with that word (or its base form). For example, the prefix un- applied to a verb (e.g. unlatch, unhinge) signals that that verb is a member of the telic aspectual class. Such correspondences, once identified on the basis of theoretical research, can be utilised to establish lexical semantic structures for words through corpus analysis. Light demonstrates the utility of morphological cues for identifying a range of lexical semantic properties, ranging from aspectual class to general semantic relation (e.g. change-of-state-rel) to antonymy. Corpus analysis driven by surface cue-lexical semantic correspondences can clearly play a useful role in automatic lexicon acquisition, but it relies on linguistic observations of those correspondences.

The linguistic analysis of logical metonymy in Chapter 5 resulted in identification of certain semantic information which would need to be represented in the lexicon in order to accurately model the conventionality of the phenomenon while still capturing a generalisation about how logical metonymy takes place. To automatically acquire the appropriate representation, corpora would need to be analysed for evidence of specific components of qualia structure. This corpus analysis would very clearly have to be guided by the linguistic theory underlying the explanatory model, including assumptions of the generative devices encoded in the lexicon (e.g. Pustejovsky 1991, pustejovsky:95a), since the results of the acquisition depend on a particular view of the processes involved in logical metonymy and a particular view of the kind of lexical structure associated with nouns.

Let us consider how the automatic acquisition of the knowledge relevant to logical metonymy might proceed, given the theoretical analysis in Chapter 5 which assumes that logical metonymy always occurs with respect to either the agentive or telic roles of a noun, but that these roles are not represented in the lexical entry of every noun. Although a certain amount of the work of acquiring qualia structure can apparently proceed via automatic means, some of it still must be built up by hand due to the interpretation required to establish whether or not the telic role should be represented for a particular noun, as will be pointed out in step (22c) below.

  1. The values of potential agentive and telic roles must be identified for every artifact-referring noun. This involves identifying the verbal relations in which the noun most frequently plays a role. Two particular kinds of verbal relations are most of interest:
    1. The agentive role of a noun is likely to be the most frequent occurrence of a creation verb in which the noun is the created entity. For example, bake would be assumed to be the value of the agentive role for cake if bake a/the cake appears more frequently in the corpus than any other creation activity involving cake.
    2. The telic role of a noun is likely to be the most frequent occurrence of any non-creation verb in which the noun plays a non-agentive role. For example, read would be assumed to be the value of the telic role for book if read a/the book appears more frequently in the corpus than any other non-creation activity involving book.
  2. Instances of logical metonymies must be identified and analysed.  
    1. Pick out instances of an aspectual verb or metonymic adjective followed by a noun (phrase) which does not refer to an event. In the case of aspectual verbs, this process must be restricted to instances in which the noun phrase is a complement of the verb and must therefore occur after deep parsing has established the structure of the sentence in which the verb appears.
    2. For those nouns which don't participate in logical metonymies in the corpus, propose that their telic role is not accessible to the process of logical metonymy, and that therefore their telic role should not be lexically represented.
    3.   For those nouns which participate in logical metonymies in the corpus, attempt to identify whether the logical metonymies are agentive role-centred or telic role-centred, i.e. whether the ellided event is a creation or a non-creation event. How this portion of the analysis could proceed automatically is not clear to me, as it involves extensive context-dependent interpretation and therefore would involve the full power of a NLU system. However, the preceding stages will have identified the relevant set of examples in the corpus, which is likely to be limited to a small set of nouns (as evidenced by the small range of possibilities found in my corpus analysis in Chapter 5). As soon as a single non-creation metonymy involving a certain noun is found, it should be assumed that the telic role for that noun is represented and the next noun can be considered.
  3. Add the potential agentive role to the lexical entry for each noun; add the potential telic role to the lexical entry only if there was evidence to do so found in the logical metonymy data.gif

What do the needs of the process described above tell us about the framework which must already be in place before this specific corpus analysis can proceed?

We also would like to use corpora to identify the frequency with which a certain word undergoes a potential alternation, as suggested by copestake_briscoe:95. For sense extensions which have no syntactic reflexes, this is a virtually impossible task, even in corpora that have been processed for syntactic structure. This is because there will be no basis for distinguishing one sense from another in the corpus. However, many sense extensions do have syntactic effects and therefore a parsed corpus can provide the basis for identifying the frequency of some of the different senses of a word.

The addition of rudimentary semantic tagging to the corpus would also aid in calculating the frequency of various sense extensions, particularly if the lexicon is augmented to include certain selectional restrictions. For example, the verb eat would likely specify that its eaten complement is foodstuff or something similar. In the context of eat, then, a noun phrase like the lamb (e.g. John ate the lamb) would be interpreted under its meat sense rather than its animal sense. This kind of information could guide the identification of a use of a word with a particular sense.

What does the previous discussion tell us about what the corpus needs to look like in order to support the desired processing? Most corpora in existence have at most part-of-speech tagging (e.g. the BNC) resulting from shallow parsing. They can be useful for identifying collocations and general co-occurrence frequencies. However, in order to identify semantic relationships, the corpus must be given more structure. Specifically, I suggest the following desiderata:

In conclusion, the extraction of information useful to advanced NLP tasks from a corpus demands a certain level of linguistic sophistication both from the corpus and from the framework which drives the corpus analysis. This information will ultimately be necessary in order for computational systems to achieve the capability to handle the problems posed by polysemy and the creativity of language use.


next up previous contents
Next: Conclusions Up: Computational Issues Previous: Prospects for automatic lexicon